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Motivation: Similarity-measure based clustering is a crucial problem appearing throughout scien- 
tific data analysis. Recently, a powerful new algorithm called Affinity Propagation (AP) based on 
message-passing techniques was proposed by Frey and Dueck In AP, each cluster is identified by 
a common exemplar all other data points of the same cluster refer to, and exemplars have to refer 
to themselves. Albeit its proved power, AP in its present form suffers from a number of drawbacks. 
. The hard constraint of having exactly one exemplar per cluster restricts AP to classes of regularly 

' shaped clusters, and leads to suboptimal performance, e.g., in analyzing gene expression data. 

' Results: This limitation can be overcome by relaxing the AP hard constraints. A new parameter 

. controls the importance of the constraints compared to the aim of maximizing the overall similarity, 

^ ' and allows to interpolate between the simple case where each data point selects its closest neighbor 

Q ' as an exemplar and the original AP. The resulting soft-constraint affinity propagation (SCAP) 

, becomes more informative, accurate and leads to more stable clustering. Even though a new a 

. priori free-parameter is introduced, the overall dependence of the algorithm on external tuning is 

^\ ' reduced, as robustness is increased and an optimal strategy for parameter selection emerges more 

, naturally. SCAP is tested on biological benchmark data, including in particular microarray data 

related to various cancer types. We show that the algorithm efficiently unveils the hierarchical 
cluster structure present in the data sets. Further on, it allows to extract sparse gene expression 
signatures for each cluster. 
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I. INTRODUCTION 



Clustering based on a measure of similarity is a crucial problem which appears throughout scientific data analysis. 
For an overview see !3|. Recently, a powerful algorithm called Affinity Propagation (AP) based on message passing 
04 was proposed by Frey and Dueck [J]. As reported impressively in the original publication, this algorithm achieves 
^ ' a considerable improvement over standard clustering methods like K-means spectral clustering [5] and super- 
^2 ■ paramagnetic clustering [1, 0| • 

Based on an ad hoc pairwise similarity function between data points, AP seeks to identify each cluster by one of its 
elements, the so-called exemplar. Each point in the cluster refers to this exemplar, each exemplar is required to refer 
to itself as a self-exemplar. This hard constraint forces clusters to appear as stars of radius one: There is only one 
central node, and all other nodes are directly connected to it. Subject to this constraint, AP seeks at maximizing the 
overall similarity of all data points to their exemplars. The solution to this hard combinatorial task is approximated 
following the ideas of belief-propagation ^10*1 . One of the important points of AP is its computational efficiency: 
whereas a naive implementation of belief propagation for N data points leads to 0{N^) messages which have to be 
determined self-consistently, the elegant formulation of Frey and Dueck allows to work with O(iV^) messages only. 
Therefore the algorithm is feasible even in the presence of very large data sets. 

Albeit its impressive power in a wide range of applications [1], AP in its present form suffers from a number of 
5^ , drawbacks. The most important ones related to the present work are: 

• The hard constraint in AP relies strongly on cluster-shape regularity: Elongated or irregular multi-dimensional 
data might have more than one simple cluster center. AP may force division of single clusters into separate 
ones. 

• Since all data points in a cluster must point to the same exemplar, all information about both the internal 
structure and the hierarchical merging /dissociation of clusters is lost. 

• AP has robustness limitations: A small perturbation of similarities may influence the choice of one or few 
exemplars, and the hard constraint may trigger an avalanche leading to a different partitioning of the data set 
into clusters. This point is particularly important in the presence of noise in the data as, e.g., in microarray 
measurements. 



• AP forces each exemplar to point to itself. A relaxation of the hard constraint may allow for cluster structures 
including second- or higher-order pointing processes. 
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These problems may be solved by modifying the original optimization task of AP. As a first step we relax the hard 
constraint by introducing a finite penalty term for each constraint violation. This softening can be chosen in a way 
that the computational complexity of the algorithm remains unchanged, but its performance on biological test sets 
is improved considerably. Moreover, relaxing the constraint helps in gaining valuable insight into the hierarchical 
structure of the clustering, increasing result robustness at the same time. By tuning the cluster number we see the 
merging of two clusters into a single one, or the dissociation of single clusters into two separated ones. 



II. THE ALGORITHM 



Given a set D = {x^j^^i of N data points, the original algorithm of Frey and Dueck takes as input a collection 
of real valued similarities S{^, v) between the pairs x^,x„, e {1, ...,N}. The choice of the similarity measure 
is heuristic, it depends on the nature of data to be clustered. In the case of high-dimensional data as present in 
gene-expression analysis, similarity may be measured by the Pearson correlation coefficient or the negative pairwise 
Euclidean distance. However, for the algorithm described below it is not even necessary that the similarities are 
symmetric. 

The algorithm searches for a mapping c : {1, A^} i-^ {1, A} which maps each data point ^ to its exemplar 
which itself is a data point. This mapping shall minimize the cost function (or energy) Ei[c\ — — X]^=i ^ip-i c^i) 
which equals minus the overall similarity of the data points to their exemplars. In the original AP algorithm c is 
restricted by A^ hard constraints: Whenever a data point is selected as an exemplar by another data point, it has to 
be its own self- exemplar. 

In this setting, we need to specify self-similarities S'(/i,/i). They describe the availability of data points for being 
self-exemplars (and thus to serve as a cluster center) . Since all data points are a priori equally suitable to play such 
a role, one is naturally led to choose all self-similarities to have some common value 5(/i, /i) = —a, fi = 1, A^. The 
free model parameter a acts like a chemical potential in statistical physics, setting the prior likelihood of the number 
of self exemplars (and of separated clusters consequently). For very small value of a, every data point prefers to be 
its own exemplar, and the number of clusters equals the number of data points. In the opposite extreme case of large 
a, self-exemplars have high cost in £'i[c]. All data points are collected in one large cluster with a single exemplar. 
For intermediate values a acts as a tuning parameter for the cluster number which decreases monotonously with a. 
Frey and Dueck argue that, if the data set has some underlying structure, the correct clustering can be identified by 
a comparably broad range of values of the self-similarity in which the inferred cluster structure does not change. If 
data are not sparse and clusters are symmetrically shaped, then affinity propagation works very well and produces 
the correct clustering in a very short time. 

Finding the cost minimum of Ei [c] subject to the self-exemplar constraint is a computationally hard task. It can 
be achieved exactly only for very small systems. The central idea of AP is therefore to identify the exemplars using 
message passing (belief propagation, BP) as a heuristic strategy Q: Real- valued messages between pairs of data 
points are updated recursively until a stable clustering emerges. The original AP equations are a direct application 
of BP (or, equivalently, Max-Sum [l^) to the clustering problem. 

There are two types of messages exchanged between data points The responsibility r{^,i') is sent from data 
point fi to candidate exemplar it reflects the accumulated evidence that /i chooses ly as its cluster exemplar. The 
availability a{fi, v) is sent from candidate exemplar v to datum /i; it reflects the appropriateness for v to be an 
exemplar for /i as a result of the self-exemplar constraint. As mentioned before, the original AP imposes constraints 
on exemplars to be self-exemplars. We modify the algorithm of Frey and Dueck by softening this hard constraint. We 
write the constraint attached to a given data point fi as follows, with p e [0, 1]: 

The first case assigns a penalty p if data point /i is chosen as an exemplar by some other data point i/, without being 
a self-exemplar. The hard-constraint limit of Frey and Dueck is recovered by setting p to zero. For p — \, x^f^lc] 
becomes identically one, the minimization task of Ei[c\ becomes unconstrained and independent for all data points, 
thus each data point chooses his nearest neighbor as an exemplar. An intermediate value of p allows to interpolate 
between these two extreme cases. It presents a compromise between the minimization of Ei[c] on one hand, and the 
search for compact clusters on the other hand. 

Finally we introduce a positive real-valued parameter /3 weighing the relative importance of the cost minimization 
with respect to the constraints. In a statistical-physics perspective, this parameter can be seen as a formal inverse 
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temperature. Its introduction allows us to define the probability of an arbitrary clustering c as 

F[c] = iexp(-/3i?i[c]) n Xlf)[c] (2) 

where the partition function Z serves to normalize P[c]. In this setting, both clusterings of non-optimal cost and 
of many violated self-exemplar constraints are suppressed exponentially. The task of finding high-scoring c can be 
understood as a minimization problem with the modified cost function 

N N 

AP is recovered by taking p = since any violated constraint sets P[c] to zero in Eq. ([2]). For general p, the optimal 
clustering c* can be determined component-wise by maximizing the marginal probabilities, 

c* = argmax^^Pp(cp) = argmax^^ E 

for all data points /x. 



A. The SCAP equations 



In the limit /3 ^ cxd, Eq. ([2]) becomes concentrated to the true cost minima. Looking at Eq. ([3]) it becomes obvious 
that p has to scale as p oc e~^P in order to have some non-trivial effect. In this limit, we find the SCAP equations 
(with \Ji,^ V, see Sec. ITV|) : 



r[[i,[i) max[-p, S'(^,^) -maxA5^^{5(^, A) -f-a(^. A)}] 
a(/i, J/) — minj^O, r(y^ v) -f max(0, ^(A, z^) 



(5) 



a{fj,, fi) = minj^p, max{0, r(A, /i)} 



These 2iV^ equations are closed and can be solved iteratively. Following Eq. ^ the exemplar c* of any data point /i 
can be computed by maximizing the marginal a posteriori probability: 



c* = argmax^ [ a{fi, v) + r(/i, v) ] 



(6) 



Compared to original AP, SCAP amounts to an additional threshold on the self-availabilities a{^,fj,) and the self- 
responsibilities r(fi, For small enough p, a{fi, ^) p in many cases, up to p = 1 (or p — 0), where every site 
chooses its best first neighbor as its exemplar. At the same time, beyond a certain threshold the self responsibility 
r(/i,/j,) is substituted with —p. For p ^ oo {i.e. p — 0) the original AP equations are recovered. 

In practice, this means that variables are discouraged to be self exemplars beyond a give threshold, even in the case 
someone is already pointing at them. The resulting clustering is more stable and obviously allows for a hierarchical 
structure where A can point to fi that can point to v etc. Also loops are possible. In most of the tests performed 
(both on artificial and biological cancer data) clusters are almost tree-like besides a dimer. 



B. Efficient implementation of the algorithm 

The iterative solution of Eqs. ^ can be implemented in the following way: 

1. Define the similarity S{ij.,v) for each set of data points. Choose the values of the self-similarity a and of the 
constraint strength p. Initialize all a(/i, v) = r{ji, i^) = 

2. For all fi G {!,... ,iV}, first update the N responsibilities r{pL,v) and then the N availabilities a(v,fi), using 
Eqs. dS]). 
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3. Identify the exemplars by looking at the maximum value of r(/i, v) +a{p, v) for given /i, according to Eq. ([6]). 

4. Repeat steps 2-3 till there is no change in exemplars for a large number of iterations (we used 10-100 iterations). 
If not converged after Tmax iterations (typically 100-1000), stop the algorithm. 

Three notes are necessary at this point: 

• Step 3 is formulated as a sequential update: For each data point /i, all outgoing responsibilities and then all 
incoming availabilities are updated before moving to the next data point. In numerical experiments this was 
found to converge faster and in a larger parameter range than the damped parallel update suggested by Frey 
and Ducck in 1]. Dependence of the result on initial conditions was not observed. 

• The naive implementation of the update equations ([5]) requires 2N'^ updates, each one of computational 
complexity 0{N). A factor N can be gained by first computing the unrestricted max and sum once for a given 
/i, and then implying the restriction only inside the internal loop over v. Like this, the total complexity of a 
global update is 0{N'^) and thus feasible even for very large data sets. 

• Belief propagation on loopy graphs is not guaranteed to converge. We observe, however, efficient convergence 
of the sequential update over wide parameter ranges. To handle the possibility of non-convergence, we have 
introduced a cutoff in the number of iterations. If this is reached, the algorithm stops, and the actual parameter 
combination is discarded. 



C. Extracting cluster signatures 



In many clustering tasks input data consist of high-dimcnsional vectors, a specific example being genome-wide 
microarrays. Frequently only few components of these vectors carry useful information about the cluster structure, 
extracting such cluster signatures is of crucial importance in understanding the mechanisms behind the cluster struc- 
ture. 

In the following, we will use the specific case of microarray data. Therefore we use the notion gene for a component 
of the input vector, even if at this stage the discussion is still general. The total number of genes is denoted by 
G. We propose a simple measure of the infiuence of single genes on the total similarity measure of a cluster, as 
compared to random choices of the exemplar selection c. For simplicity, we assume the similarity between data points 
Xfj, = {x}^, ■■■,x^) and Xi, — {x^, to be additive in single-gene contributions 



(7) 



This is true, e.g., for the Pearson correlation or the negative square Euclidean distance. It can be easily generalized 
to similarity measures which are given by a monotonous function of a sum over gene contributions (like the negative 
of the Euclidean distance which is the square root of the sum of single-gene contributions). 

Having found a clustering given by the exemplar selection c, we can calculate the similarity of a cluster C defined 
as a connected component of the directed graph given by c. It is given by 



5(C) = ^5XC) , 



(8) 



as a sum over single-gene contributions 



These have to be compared to random exemplar choices which are characterized by their mean 

1 ^ 



N- 



and variance 



( N r N 



■ N 



(9) 



(10) 



(11) 
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The relevance of a gene can now be ranked according to 

A5^(C) ^^^^ 

which measures the distance of the actual S'*(C) from the distribution of random exemplar mappings. Genes can be 
ranked according to the value of li (C) , highest-ranking genes are considered a cluster signature. The same procedure 
can be carried through for each cluster independently, but also for cluster combinations. 



III. APPLICATION TO BIOLOGICAL DATA 



A. Iris data 



The data consist of measurements of sepal length, sepal width, petal length and petal width, performed for 150 
flowers, chosen from three species of the flower Iris. It is a benchmark problem for clustering Q. Super-paramagnetic 
clustering is able to cluster 125 of the data points correctly, leaving 25 points unclustered [y]. 

When we apply AP on Iris data, we identify three clusters making 16 errors. With SCAP, we identify them with 
just nine errors. We use the Manhattan distance measure for the similarity function, i.e S{^, v) = — I^^Ji ~ ^M- 

We saw that the species Iris Setosa separates without any errors. On increasing the value of p, the Iris Setosa 
cluster stays intact and the clusters for Versicolor and Virginica merge with each other, reflecting the fact that they 
are closer to each other than to Setosa. The errors occur because some samples from these species were closer to 
samples from other species than to their own. 



B. Brain cancer data 



We used a test data set monitoring the expression levels of more than 7000 genes for 42 patients, which were 
previously correctly classified into 5 diagnosis types by an a poster^or^ assessment method jPomerov e.t al 2002^ 



(10 meduUoblastoma, 10 malignant glioma, 10 atypical teratoid/rhabdoid tumors, 4 normal cerebella, 8 primitive 
neuroectodermal tumors). Each array was filtered, log-normalized to mean zero and variance one, resulting in G = 
6010 genes. Due to this choice Pearson correlation and negative square Euclidean distance are equivalent. The 
diagnosis information was not used during clustering, but only for checking the algorithmic outcome. 

Imposing five clusters in AP and SCAP: Since we knew that the correct clustering was to identify five different 
patterns, our first approach was to tune a and p in order to get five clusters. First, we fixed p to infinity (original 
AP) and changed a finding around a ^ 120 the desired number of 5 clusters with 8 errors. The error was calculated 
a posteriori by counting every data point which referring to an exemplar of a different diagnosis. Next we fixed cr to 
a sufficiently large value (the result becomes insensitive on a once the latter takes large values) , and we changed p. 
In this case, for p = 12 we got 6 clusters with 8 errors, for p — IQ A clusters with again 8 errors. 5 clusters were not 
found to correspond to any extended p-region. Note that in both cases all errors occur in the last cluster: samples 
supposed to take diagnosis 5 (PNET) rarely find an exemplar of the same class. Instead they distribute over the other 
four diagnoses. 

Clustering with AP: Then, instead of fixing the number of clusters, we changed a continuously for p oo. We 
counted the number of clusters and of errors as a function of a, see Fig. [TJ The algorithm ground state (configuration 
of maximum marginals values) in the limit of cr ^ cx) is a single cluster. 

The first non trivial clustering occurs when the number of clusters remain unchanged for a stable range of a values. 
In this preliminary study, we took that to be the actual predicted data clustering. Hence, by looking at Fig. [U 
we would conclude that there are three well-distinguishable clusters in the present data set. Look, however, to the 
number of errors: It is found to be 14-15 in this range, basically due to the wrong assignment of two entire classes to 
only three exemplars. Four or five clusters can be imposed and lead to lower error values, but require fine-tuning of 
a. 

Clustering with SCAP: We than fix a to be very large and change only p. For p — we start with seven clusters, 
this number decreases rapidly as p increases, see Fig.O As before, the point at which the number of clusters is robust 
against changes in p was taken as the best SCAP clustering. From Fig. [5] we conclude that SCAP identifies 4 clusters. 
The number of errors in classification is 8. 

Right from p — 0, where each data point chooses its closest neighbor as his exemplar, errors are due to misclas- 
sifications of the fifth diagnosis (PNET). The other data points select exemplars of the same diagnosis, but various 
clusters of same diagnosis exist. Only in the case of four clusters, as shown in Fig. [3l each of the first four diagnoses 
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FIG. 1; Plot of number of clusters as a function of self-affinity S{jj,, /i) = —a, for p = oo (the original AP algorithm). Based on 
this we would conclude that the data has three nontrivial clusters. 




FIG. 2: Plot of number of cluster and errors of SCAP as a function of cutoff parameter p. This plot suggests that the data has 
four clusters. 



is assembled in an isolated cluster, with the PNET arrays distributed over the three cancer-related clusters. The 
normal tissue (30-33 in the figure) is well-separated from all others. Only if we go towards three clusters, it merges 
with diagnosis type (0-9) (meduUoblastoma), showing that these two are closer in expression in between them than 
compared to others. A more detailed analysis of the brain cancer data is provided in the supplementary material. 

Note that SCAP also provides information about the internal organization inside the clusters. We find, e.g., that the 
misclassified patterns are always peripherical cluster elements. No other data point refers to them. This information 
is lost in AP. Due to the hard constraint all points belonging to the same cluster refer to the same exemplar, and 
information about the internal cluster structure is not contained in c*. A graphical representation of the cluster 
struc ture in this case is co ntained in the supplementary material. 

In [Po merov et al. 2002t data were clusterized using hierarchical clustering. Even if the overall cluster structure is 
similar to the one we found, there is no clear-cut clustering into 4-5 classes, some arrays (well-clustered with SCAP) 
were only added at very late stages of hierarchical clustering. The global nature of SCAP leads to a better clustering 
performance than the local and greedy hierarchical clustering. 

Anothe r interesting po int comes from the comparison of our clustering results with the supervised classification 
results of [Dettling 2004 1. There, a number of state-of-the-art classification algorithms is applied, with training sets 
containing 2/3, test sets containing 1/3 of the data points. Dettling finds that the minimal generalization error made 
is 23.8%, corresponding to ca. 10 errors on a data set of cardinality 42. It is int eresting to not e that SCAP in the 
clustering corresponding to 4 clusters makes only 8 errors. Note that training in [Dettling 2004| is done on a subset 
of patterns, but supervision in this case seems to add no valuable information to the unsupervised clustering results. 

Last but not least, we use the procedure described above to extract cluster signatures in the most stable case of 
four clusters depicted in Fig. [31 The lists of the highest ranking genes together with their relevance value Ii[C) is 
given in the supplementary material. The number of statistically relavant genes (we consider a threshold Ii{C) — 3) 
depends on the cluster and is largest for the normal tissue (42 genes), it is much smaller in particular for the first 
cluster (ca. 4 genes). If we take the first 15-25 genes per cluster, i.e., an overall signature of 60-100 genes, we already 
find basically the same clustering as before, only two new errors of previously well-assigned patterns appear. At gene 
signatures 120-240, only one of these errors survives. We therefore find that the signature found in this way carries 
most of the information needed for the clustering. Note also, that due to the fact that in an unsupervised way we 
did not separate the fifth diagnosis type into a single cluster, we do not have by definition a cluster signature for this 
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cancer type. 



C. Other benchmark cancer data 



Lymphoma cancer data: We used a data set of 62 patients for 4026 genes, showing 3 different diagnosis 
Alizadeh et al. 200Cll |. In the hmit of p going to infinity, we find the first nontrivial clustering for a between 150 — 250. 



In this regime AP group data into 3 sets, making with 3 error. For very high a and varying p, the 3-groups clustering 
becomes more stable and robust, while the algorithm makes just one assignment prediction error. In this case, Det- 
tling finds a minimal generalization error of 0.95%, corresponding to less than one error in 62 patterns. Supervision 
adds some information, even if clustering itself makes only one error. 

SRBCT cancer data: This set has 63 samples with 2308 genes and 4 expression diagnosis patterns |Kahn et al. 200il |. 
For p going to infinity, the best tuning-robust estimates groups cluster data into 5 clusters making as many as 22 
errors. On the other hand, with finite p, SCAP finds a regime of 4 clusters, making only 7 assignment errors. Here, 
Dettling reports only 1.24% generalization error in supervised classification, corresponding to less than one error on 
63 patterns. Classification thus performs considerably better than clust ering alone. 

Leukemia: This set has 72 samples with 3571 genes and 2 diagnoses [Golub et al. 1999f . In the case of infinite p, 
the original AP groups data into 2 clusters with 4 errors, while for variable p (fixing a very large) modified AP finds 
2 clusters with 2 errors. Also classification leads to 2.5% of errors, a result which is slightly better than our clustering 
result. 



IV. METHODS 



In the process of choosing exemplars, we need to calculate marginals 

P^{c) is the probability that data point fi chooses point c as its exemplar. The calculation of marginals can be 
done iteratively via a message-passing algorithm called Belief Propagation (BP) 0. It is exact on tree factor graphs 
but usable heuristically in the general case. Together with a generalized larger family of message-passing algo- 
rithms, it was shown to be very powerful i n solving NP-hard combinatorial problems on locally tree-like structures 
[Mezard et al. 200^ iHartmann et al. 2005j | . Recently, the applicability of BP was also shown to be efficient in som e 
important problems giving rise to dense and loopy factor graphs ^Kabashima 20031 . iBraunstein and Zecchina 2006l | . 

Looking at figures (g]) and (O, BP computes beliefs P^^{c) for the marginal probabilities as products of messages 
Ax^^{c) coming from each compatibility constraint, times the local prior computed as the exponential of the similarity 
between point fx and its putative exemplar c. Up to overall normalization, we write: 

P^''{c)^'[[Ax^,{c)e^'^^'^^ (14) 

A 

where /3 plays the role of an annealing parameter measuring the relative importance given to the priors compared to 
the information passed by the messages. Message Ax^fj_{c) can be interpreted, as the probability that constraint A 
alone forces fj, to select exemplar c. It can be calculated via the following self-consistent equations 

{ca} A^^i/ 

S^_,(c) « [] AA^^(c)exp(/35(^,c)) (16) 
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where the iV^ functions B^^^{c) can be seen as probabilities that data point ^ chose c to be its exemplar if constraint 
A were absent in Eq. These probabilities are called cavity probabilities, because the disregarding of one data point 
/ constraint effectively carves a cavity in the original factor graph. The link direction of functions As and Bs is shown 
if fig. ^ together with the problem's factor graph. Fig. ([5]) shows a pictorial representation of the flow of messages 
(HSl) and (Ull). 

Along the lines of [1] and [2], but bearing in mind the modified form for the compatibility constraints, eq. ()15|) can 
be simplified after a few manipulations in the following way, depending on cases: 



{^.^v) A {c^ n) A^^p(c ■.c^ii) = — [p - 



{l-p)\{{l-B^^,{^,))\ (17) 



[^^v) A (c = ^) ^ = [p + (1 - p)B^^^{^l))] 



1 

1 



(fi^iy) A (c^fi) A^^^{c:c^^i) = —^[p+{l-p) 



*((B^_.^(A*)) + n(l-5A^M(A*)))] 

with Z-^_^^ being normalization constants. 

It is remarkable that the number of effectively independent quantities present in Eqs. is much smaller than 
the apparent 0{N^) real valued numbers. Indeed, functional messages A take only 2 different values: ^^-^^(/i), 
that from now on will be called /z) to avoid index redundancy, and A^_>i,(c : c ^ fi) = A^^^, independently 
on c, as long as it is ^ fi. The exchange of indexes in A{fj,, v) is pure convention, but it has been introduced for 
coherency with the definition of availabilities given in [l]. It follows immediately from the normalization condition 
that i^^, = (1 - A{v, ^j.))/{N - 1). 

For the cavity probability functions, manipulation of Eq. (|16p involving the use of the last normalization condition 
and of the normalization constant rescaling, leads to 

{^J. = iy) A (c = ^) -> -Bp^p(Ai) = = 

(pi = i^) A (c ^ ^) ^ Bf,^^{c -.c^fi)^ 

1 (TV - 1)A(/^, c)e''^(^''=) 
" zfT^ l-A{fi,c) 

(18) 

ifi 1^) A {c ^ fi) ^ B,,^,{v) = R{fi, v) = — (19) 



(,l3S(tJ.,v) 



in^v) A (cy^v) B^^^{c: c^ v) ^ 

_ 1 (A^ - l)A(/x, c)e^'g(^-") 
1-A(m,c) 

with Z^_,^j guaranteeing normalization B^^ij{c) — 1. Messages B^^y{v) -R(/i, i^) have been also renamed with 
a symbol coherent with the responsibility-availability notation of [H. It can be seen that self consistent equations 
close into the 0{N'^) quantities A{l', n) and R{ijl,v) alone. Indeed, the effective dependence on the exemplar choice 
is dropped, and the computational size of the problem reduces by a factor TV. The set of equations of A and R can 
be solved iteratively via the BP algorithm. The case in which one is interested not in the whole form of the posterior 
probability function, but only in retaining information about the most probable exemplar chosen by each data point. 
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/ 



S(H,c) 

FIG. 4: SCAP factor graph and direction of sent messages. Variables are represented by circles, constraints by squares. 

□ ■■■■ □ □ □ 





FIG. 5: Pictorial representation of messages flow. 



can be seen to be equivalent to taking the /? oo limit where availabilities a and responsibilities r are introduced in 
the following way: 



V — '/I 



A 



(20) 
(21) 
(22) 
(23) 



and treating the exponential scaling in a regime where prior similarities between data points S are of the same order 
of magnitude of the valued of a and r. The rescaling of responsibilities can be freely done as it does not change the 
number of independent variables. From the last definitions one is led to equations 



1 



1 + {N - l)e-P<^(M^'') 
1 



1 -|- g-/3r(^,iy) 



(24) 
(25) 



where the last relation already assumes a large-/? limit with non-degeneracy of the most probable value of the cavity 
probabilities. This hypothesis is equivalent to having non-degenerate choices of exemplars for all data points, i.e., 
to the existence of a single optimal clustering identified via the SCAP algorithm as the unique ground state of the 
system energy This is a sensible assumption, but it is not always satisfied in interesting cases. Studying the 
degenerate number and behavior of clustering choices is another crucial question that is only partially answered by 
the introduction of the relaxation parameter p and will be the subject of further work beyond this paper. In the large 
/3 regime, in order to work with quantities all with the same scaling, it is useful to define 



(26) 



and consider p, a and r fixed varying /3. Equating Eqs. ((24| and ((25|) with Eqs. p?|) and p9|) respectively, and 
extracting the leading terms in the large (3 limit assuming no degeneracy, the following equations are found, using 
Eq. (EH): 
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a[^^v) = mill 0, max(— min(0, r(;^, v))) + / ^ max(0, r{X^v) 



(27) 




Making another change of variables redefining the self-responsibiUties as 



max^{r(^,//)} ^ r(^,//) 



(28) 



we get, in terms of the rescaled quantities, 



max(— p, min(0, r(/i, /^))) = r(/i, /j,) 



(29) 



leading to SCAP equations ([5]) [IT]. After convergence, marginals can be written [H, 



as 



^m(c) - P^^^(c) cx[]AA^^(c)exp(/35(/i,c)) 



A 



(30) 



In the /? — > oo limit, one can write equation ([6|). 



V. DISCUSSION 



Affinity Propagation is a new powerful tool for unsupervised clustering. It has many very strong points. First 
it is very efficient, convergence to the final clustering is very fast, the latter appears to be independent on the 
initialization of messages. Second, due to its hard constraints AP identifies exemplars which are prototypical data 
points representing a whole cluster. 

This last point is, however, also a first limitation of the original AP algorithm. If clusters cannot be well-represented 
by a single cluster exemplar, AP has to fail. The hard constraint renders the algorithm greedy, and small fiuctuations 
in the similarity measure may trigger avalanches in the exemplar choice leading to different clusterings for only slightly 
modified model parameters. 

We have introduced a soft-constraint version of affinity propagation which is able to cure a part of these problems 
without loosing the efficiency of the original AP: 

• By relaxing the hard constraint on clusters exemplars, we could introduce a parameter (p) controlling the 
algorithm greediness, p is a better tuning parameter than a (it is more informative and leads to more robust 
and stable clustering) and it is easier to interpret the statistical meaning of its tuning process. 

• Clusters are more robust than in the original formulation of the algorithm. Moreover, even though a second a 
priori free-parameter is introduced, the overall dependence of the algorithm on free parameters is reduced, and 
an optimal tuning strategy naturally emerges. 

• The cluster structure can be efficiently probed. This concerns the internal structure of the clusters since SCAP 
is able to identify central and peripherical nodes of each clusters, as well as the hierarchical organization leading 
to a process of cluster merging if cluster number is reduced by looking to less fine structures. 

• In the case of high-dimensional data, the relation between data points and their exemplars can be used to 
extract a sparse cluster signature. In the case of brain tumors, we have found that 20-40 genes per cluster are 
sufficient to reproduce almost the same clustering as found using all genes. 

We conclude that SCAP is more efficient than AP in particular in the case of noisy, irregularly organized data - 
and thus in biological applications concerning microarray data. The computational efficiency of SCAP allows there 
to treat also very large data sets. 
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